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Abstract 

We consider two ways that one might convert a prediction of sea surface temperature (SST) into a 
prediction of landfalling hurricane numbers. First, one might regress historical numbers of landfalling 
hurricanes onto historical SSTs, and use the fitted regression relation to predict future landfalling 
hurricane numbers given predicted SSTs. We call this the direct approach. Second, one might regress 
basin hurricane numbers onto historical SSTs, estimate the proportion of basin hurricanes that make 
landfall, and use the fitted regression relation and estimated proportion to predict future landfalling 
hurricane numbers. We call this the indirect approach. Which of these two methods is likely to work 
better? We answer this question for two simple models. The first model is reasonably realistic, but 
we have to resort to using simulations to answer the question in the context of this model. The second 
model is less realistic, but allows us to derive a general analytical result. 



1 Introduction 

There is a great need to predict the distribution of the number of hurricanes that might make landfall in 
the US in the next few years. Such predictions are of use to all the entities that are affected by hurricanes, 
ranging from local and national governments to insurance and reinsurance companies. How, then, should 
we make such predictions? There is no obvious best method. For instance, one might consider making a 
prediction based on time-series analysis of the time-series of historical landfalling hurricane numbers; one 
might consider making a prediction of basin hurricane numbers using time-series analysis, and convert 
that prediction to a prediction of landfalling hurricane numbers; one might consider trying to predict 
SSTs first, and convert that prediction to a prediction of landfalling numbers; or one might try and use 
output from a numerical model of the climate system. All of these are valid approaches, and each has 
their own pros and cons. 

In this article, we consider the idea of first predicting SST and then predicting hurricane numbers given 
a prediction of SST. There are two obvious flavours of this. The first is what we will call the 'direct' 
(or 'one-step') method, in which one regresses historical numbers of landfalling hurricanes directly onto 
historical SSTs, and uses the fitted regression relation to convert a prediction of future SSTs into a 
prediction of future hurricane numbers. The second is what we will call the 'indirect' (or 'two-step') 
method, in which one regresses basin hurricane numbers onto historical SSTs, predicts basin numbers, 
and then predicts landfalling numbers from basin numbers. In the simplest version of the indirect method 
one might predict landfalling numbers as a constant proportion of the number of basin hurricanes, where 
this proportion is estimated using historical data. 
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Consideration of the direct and indirect SST-based methods motivates the question: at a theoretical level, 
which of these two methods is likely to work best? This is a statistical question about the properties 
of regression and proportion models. We consider this abstract question in the context of two simple 
models. The first model is the more realistic of the two. It uses observed SSTs, models the mean 
number of hurricanes in the basin as a linear function of SST, and models each basin hurricane as having 
a constant probability of making landfall. We run simulations that allow us to directly compare the 
performance of the direct and indirect methods in the context of this model. The second model is less 
realistic, but allows us to derive a general analytical result for the relative performance of the direct and 
indirect methods. In this model we represent SST, basin and landfalling hurricane numbers as being 
normally distributed and linearly related. 

We don't think the answer as to which of the direct or indirect methods is better is a priori obvious. 
On the one hand, the direct method has fewer parameters to estimate, which might work in its favour. 
On the other hand, the indirect method allows us to use more data by incorporating the basin hurricane 
numbers into the analysis. 

Section [2] describes the methods used in the simulation study, and section [3] describes the results from 
that study. In section[4]we derive general analytic results for the linear-normal model. Finally in section[5] 
we discuss our results. 



2 Simulation-based analysis: methods 

For our simulation study, we compare the direct and indirect methods described above as follows. 



2.1 Generating artificial basin hurricane numbers 

First, we simulate 10,000 sets of artificial basin hurricane numbers for the period 1950-2005, giving a total 
of 10,000 x 56 = 560,000 years of simulated hurricane numbers. These numbers are created by sampling 
from poisson distributions with mean given by: 

X = a + (3S (1) 

where S is the observed MDR SS T for each year in the period 1950-2005. The values of a and f3 are 



derived from model 4 in table 7 in lBinter et alj (|2007l ). in which observed basin hurricane numbers were 



regressed onto observed SSTs using data for 1950-2005. They have values of 6.25 and 5, respectively. 

The basin hurricane numbers we create by this method should contain roughly the same long-term SST 
driven variability as the observed basin hurricane numbers, but different numbers of hurricanes in the 
individual years. We say 'roughly' the same, because (a) the linear model we are us ing to relate S ST to 
hurricane numbers is undoubtedly not exactly correct, although given the analysis in lBinter et al. ( 2007 ) 



is certainly seems to be reasonable, and (b) the parameters of the linear model are only estimated. 



2.2 Generating artificial landfalling hurricane numbers 

Given the 10,000 sets of simulated basin hurricane numbers described above, we then create 10,000 sets of 
simulated landfalling hurricane numbers by applying the rule that each basin hurricane has a probability 
of 0.254 of making landfall (this value is taken from observed data for 1950-2005). 

The landfalling hurricane numbers we create by this method should contain roughly the same long- 
term SST driven variability as the observed landfalling series, but different numbers of hurricane in the 
individual years. They should also contain roughly the right dependency structure between the number 
of hurricanes in the basin and the number at landfall (e.g. that years with more hurricanes in the basin 
will tend to have more hurricanes at landfall). 



2.3 Making predictions 



We now have 10,000 sets of 56 years of artificial data for basin and landfalling hurricanes. This data con- 
tains a realistic representation of the SST-driven variability of hurricane numbers, and of the dependency 
structure between the numbers of hurricanes in the basin and at landfall, but different actual numbers 
of hurricanes from the observations. We can consider this data as 10,000 realisations of what might 
have occurred over the last 56 years, had the SSTs been the same, but the evolution of the atmosphere 
different. This data is a test-bed that can help us understand aspects of the predictability of landfalling 
hurricanes given SST. 

The observed and simulated data is illustrated in figures [1] to [5] Figure [T] shows the observed basin 
data (solid black line) and the observed landfall data (solid grey line). The dashed black line shows 
the variability in the observed basin data that is explained using SSTs. The dotted grey line shows the 
variability in the observed landfall data that is explained using SSTs using the direct method, and the 
dotted grey line shows the variability in the landfall data that is explained using SSTs using the indirect 
method. 

Figures [2] to [5] show 4 realisations of the simulated data. In each figure the dotted and dashed lines are 
the same as in figure [TJ and show the SST driven signal. The solid black line then shows the simulated 
basin hurricane numbers and the solid grey line shows the simulated landfalling hurricane numbers. 

We test predictions of landfalling hurricane numbers using the direct method as follows: 

• we loop through the 10,000 sets of simulated landfalling hurricanes 

• for each set, we miss out one of the 56 years 

• using the other 55 years in that set, we build a linear regression model between SST and landfalling 
hurricane numbers 

• we then use that fitted model to predict the number of landfalling hurricanes in the missed year, 
given the SST for that year 

• we calculate the error for that prediction 

• we then repeat for all 10,000 sets (missing out a different year each time) 

• this gives us 10,000 prediction errors, from which we calculate the RMSE 

We test the indirect method in almost exactly the same way, except that this time we also fit a model 
for predicting landfalling numbers from basin numbers. 



2.4 Comparing the predictions 

We compare the direct and indirect predictions in two ways: 

• First, we compare the two RMSE values 

• Second, we count what proportion of the time the errors from the direct method are smaller than 
the errors from the indirect method 

We also repeat the entire calculation a number of times as a rough way to evaluate the convergence of 
our results. 



3 Simulation-based analysis: results 



We now present the results from our simulation study. The RMSE for the direct method is 1.61 hurricanes, 
while the RMSE for the indirect method is 1.58 hurricanes. This difference is small, but the sign of it 
does appear to be real: when we repeat the whole experiment a number of times, we always find that 
the indirect method beats the direct method. 



The indirect method beats the direct method 51.8% of the time. 



Given the design of the experiment, these results tell us how the two methods perform, on average over 
the whole range of SST values. Next year's SST, however, is likely to be warm relative to historical SSTs. 
We therefore also cons ider the more s pecifi c question of how the methods are likely to perform for given 
warm SSTs. Based on lLaepple et al.l (|2007l ). we fit a linear trend to the historical SSTs, and extrapolate 
this trend out to 2011. This then gives SST values that are warmer than anything experienced in history 
(27.987°C to be precise). We then repeat the whole analysis for predictions for this warm SST only. 
The results are more or less as before: the indirect method still wins, only this time by a slightly larger 
margin. The ratio of RMSE scores (direct divided by indirect) increases from 1.02 to 1.04. 



4 The Linear normal case 



We now study a slightly less realistic model, in which we take SSTs and hurricane numbers in the basin 
and at landfall to be normally distributed. These changes allow us to derive a very general result for the 
relative performance of the direct and indirect methods. 



4.1 The setup 

Here's how we set the problem up in this case. 

Consider two simple regression models for centred random variables Y and Z , 

Y = Xp + e, e~(0,a £ 2 /„), 
Z = Y 7 + V , T]^(0,a 2 J n ), 



where e and 77 are independent. Here X, Y , Z, e and 7/ are n X 1 column vectors, (3 and 7 are scalars, 
and /„ is the n x n identity matrix. We will assume X is fixed. 

In relation to the hurricane problem, X is the time-series of n years of SST values, Y is the time-series of 
n years of basin hurricane numbers and Z is the time-series of n years of landfalling hurricane numbers. 
Note that in our notation X is the whole time-series of SST, written as a vector, and similarly for Y and 
Z. Using vector notation avoids the messy use of subscripts. Two immediate comments about this setup: 
(a) we are assuming that basin and landfalling hurricane numbers are normally distributed. This doesn't 
really make sense, since they are counts that can only take integer values: using a poisson distribution 
would make more sense. We are starting off by addressing this question for normally distributed data 
because it's more tractable that way; (b) we are assuming a linear relationship (with offset and slope) 
between basin hurricanes and landfalling hurricanes. This is also a little odd, since there is no reason 
to have an offset in this relationship: if there aren't any basin hurricanes, there can't be any landfalling 
hurricanes. The most obvious model would be that each hurricane has a constant proportion of making 
landfall. Again, we are starting off by addressing this question in a linear context because it's more 
tractable that way. 

We want to know about the accuracy of forecasts that we might make with the direct and indirect 



methods. This translates mathematically into saying that we want to estimate 

E(z n+1 ) = E(y n+1 )~, (2) 

= Xn+iP^ (3) 

= x n+1 S (4) 

where S — [3j. 

The problem then boils down to measuring the quality of the estimator of S since, if z n +i = x n +\5 is an 
estimator of E{z n+ \) then 

MSE(z„+i) = MSE(z„+iJ) (5) 
= E[(x n+1 5 - x n+1 5)(x n+1 5 - x n+1 6)'} (6) 
= x n+1 MSE{5)x' n+1 . (7) 

So we now consider the direct and indirect methods for estimating S. 

4.2 Direct estimator of 5 

We start by considering the direct, or one-step, method. This means we consider the relationship between 
X and Z, ignoring Y. The usual OLS estimator for S is 

5+ = (X'X^X'Z (8) 
= {X'X)- 1 X l {Xf3 1 + e 1 + i 1 ) (9) 
= S+ (X'xy^'iej + f]). (10) 

What are the statistical properties of this estimator? 
In terms of mean: 

E(rf)=8 (11) 

i.e. the estimator is unbiased. 
In terms of variance 

Var(^) = {X'X)- 1 X'Xw{e 1 + r,)X{X'X)- 1 . (12) 

We know that Var(e7 + ?/) = er 2 / n 7 2 + a^I n , so 

Var( ( 5t) = (X'X)- 1 (a e V + ^)- (13) 

By equation [7] this then gives us an expression for the performance of the direct method. 

4.3 Indirect estimator of S 

We now consider the indirect, or two-step, method. This means considering the relationships between X 
and Y, and Y and Z. 

First, we consider estimating each regression separately. The OLS estimators for the slopes in each case 
are: 

$ = (X'X^X'Y (14) 
= (3+ (X'X^X'e (15) 

7 = (Y'Y^Y'Z (16) 
= 7+ (Y'Y^Y't] (17) 



We now put the two models together, to create a single regression model based on the separate estimates 
for the two steps. We call the estimate of the slope of this combined model 8. Combining the expressions 
above, we have that: 



8 = 7 

= /? 7 + {X'X)- 1 X'e 1 + (3(Y'Y)~ 1 Y'ri + {X'X^X'eiY'Y^Y'r] 



(18) 
(19) 



What are the statistical properties of this estimator 81 

It is clear (by independence of e and 77) that 8 is unbiased; 

E{8) = 07 
= 8 



(20) 
(21) 



The variance is more awkward. Note that if e were known then (3 and Y would be fixed constants. Thus, 



E(S\e) = E0j\e) 

= 0E{l\e) 

= 07, 

Vax(8\e) = Var(0 7 | e ) 

= /3Var(7|e)/3' 

= PiY'YY^P'al 



(22) 
(23) 
(24) 
(25) 
(26) 
(27) 



and so 



Var(5) = Var(/3 7 ) 

= S[Var(0 7 |e)] + X&v[E0^\e)] 
= E0(Y'Y)- 1 $'}a 2 n + 7 V^0) 7 '. 



(28) 
(29) 
(30) 



where we have used a standard relation for disaggregating the variance: 

var(a) = £[var(a|b)] + var[£'(a|6)] 



(31) 



Using the facts that 



E(Y'Y) 



= (3'X'X/3 + na 2 e 
= (3(3' + {X l X)~ 1 a 2 E 



(32) 
(33) 



and approximating to second order: 



Var((5) = 



(3 2 +q 2 



(3 2 + nq 2 



{x'xy^l + qW 



(34) 



where q 2 = (X'X)- 1 ? 2 



4.4 Comparing the two estimators 



We are now in a position to compare the estimators for the direct and indirect methods. Subtracting 
equation [34l from equation [T3l gives: 



Var(5+) - Var(S) = {X 1 X)~ x {a 2 j 2 + a 2 ) 



(P + q 2 



(x'x)-^l {x'x)-^h 2 



= (x'xy^l- 



(3 2 + q 2 



= 1 



(3 2 +q 2 



[3 2 + nq 2 
(n-l)q 2 



[3 2 + nq 2 

(X'X)-'a 2 



P 2 + nq 2 _ 
(X'X)-'a 2 



(3 2 + nq 2 



(x'xy^ 2 



(35) 
(36) 
(37) 
(38) 



The right hand side of this equation is clearly positive for n > 1. 
This indicates: 



• that using the indirect method is an improvement on the direct method, at least up to our second 
order approximations 

• that if |j- is small or <r 2 large then using the indirect method provides a marked improvement over 
the direct approach 



5 Conclusions 



We have compared the likely performance of direct and indirect methods for predicting landfalling hurri- 
cane numbers from SST. The direct method is based on building a linear regression model directly from 
SST to landfalling hurricane numbers. The indirect method is based on building a regression model from 
SST to basin numbers, and then predicting landfalling numbers from basin numbers using a constant 
proportion. 

First, we compare these two methods in the context of a reasonably realistic model, using simulations. 
We find that the indirect method is better than the direct method, but that the difference is small. 

Secondly, we compare the two methods in the context of a less realistic model in which all variables are 
normally distributed. For this model we are able to derive the interesting general result that the indirect 
method should always be better. 

Which method should we then use in practice? If we had to chose one method, our results seem to 
imply that we should choose the indirect method, since it is more accurate. The simulation results 
suggest, however, that the performance of the two methods is likely to be very close for the values of the 
parameters appropriate for hurricanes in the real world. Given the possibility to use two methods we 
would use both, as alterative points of view. 

Ideally we would also be able to solve the more realistic model analytically, as we have done for the 
linear- normal case. We are working on that. 
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Figure 1: Atlantic basin and landfalling hurricane numbers for the period 1950 to 2005 (solid lines), with 
the component of the variability that can be explained by SSTs (broken lines). 




Figure 2: One realisation of simulated basin and landfalling hurricane numbers (solid lines), with the 
SST driven components (broken lines). 




Figure 3: As in figure [2 but for a different realisation. 




Figure 4: As in figure [2 but for a different realisation. 




Figure 5: As in figure [2 but for a different realisation. 
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